4 research outputs found
Recommended from our members
Learning and Improving Policies for Probabilistic Planning Problems
In this work, we study the problem of learning and improving policies for probabilistic planning problems. In the first part, we train neural network policies for probabilistic planning problems modeled as factored Markov decision problems. The objective is to train problem-specific neural networks via supervised learning to imitate the action choices of expert planners. In the second part, we focus on the problem of online policy improvement, where we try to improve on a given base policy via online search. Since search trees for these problems tend to be huge, in practice, action branches need to be pruned, which can affect policy improvement adversely. We formalize this notion by introducing the choice function framework and establish sufficient conditions on actions expanded in search trees for guaranteed policy improvement. In the next part, we draw attention to the fact that theoretical guarantees of policy improvement can fail when the ideal conditions assumed in theory do not hold in practice. We propose benchmark problems, baselines and metrics to assess the empirical performance of online policy improvement algorithms. In the final part, we focus on approximation via state aggregation in MDPs and study the theoretical guarantees of several aggregation schemes
The Choice Function Framework for Online Policy Improvement
There are notable examples of online search improving over hand-coded or
learned policies (e.g. AlphaZero) for sequential decision making. It is not
clear, however, whether or not policy improvement is guaranteed for many of
these approaches, even when given a perfect evaluation function and transition
model. Indeed, simple counter examples show that seemingly reasonable online
search procedures can hurt performance compared to the original policy. To
address this issue, we introduce the choice function framework for analyzing
online search procedures for policy improvement. A choice function specifies
the actions to be considered at every node of a search tree, with all other
actions being pruned. Our main contribution is to give sufficient conditions
for stationary and non-stationary choice functions to guarantee that the value
achieved by online search is no worse than the original policy. In addition, we
describe a general parametric class of choice functions that satisfy those
conditions and present an illustrative use case of the framework's empirical
utility
Training Deep Reactive Policies for Probabilistic Planning Problems
State-of-the-art probabilistic planners typically apply look- ahead search and reasoning at each step to make a decision. While this approach can enable high-quality decisions, it can be computationally expensive for problems that require fast decision making. In this paper, we investigate the potential for deep learning to replace search by fast reactive policies. We focus on supervised learning of deep reactive policies for probabilistic planning problems described in RDDL. A key challenge is to explore the large design space of network architectures and training methods, which was critical to prior deep learning successes. We investigate a number of choices in this space and conduct experiments across a set of benchmark problems. Our results show that effective deep reactive policies can be learned for many benchmark problems and that leveraging the planning problem description to define the network structure can be beneficial
Hindsight Optimization for Probabilistic Planning with Factored Actions
Inspired by the success of the satisfiability approach for deterministic planning, we propose a novel framework for on-line stochastic planning, by embedding the idea of hindsight optimization into a reduction to integer linear programming. In contrast to the previous work using reductions or hindsight optimization, our formulation is general purpose by working with domain specifications over factored state and action spaces, and by doing so is also scalable in principle to exponentially large action spaces. Our approach is competitive with state-of-the-art stochastic planners on challenging benchmark problems, and sometimes exceeds their performance especially in large action spaces